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The use of principal component methods to analyze functional 
data is appropriate in a wide range of different settings. In studies of 
"functional data analysis," it has often been assumed that a sample 
of random functions is observed precisely, in the continuum and with- 
out noise. While this has been the traditional setting for functional 
data analysis, in the context of longitudinal data analysis a random 
function typically represents a patient, or subject, who is observed at 
only a small number of randomly distributed points, with nonnegli- 
gible measurement error. Nevertheless, essentially the same methods 
can be used in both these cases, as well as in the vast number of 
settings that lie between them. How is performance affected by the 
sampling plan? In this paper we answer that question. We show that 
if there is a sample of n functions, or subjects, then estimation of 
eigenvalues is a semiparametric problem, with root-n consistent es- 
timators, even if only a few observations are made of each function, 
and if each observation is encumbered by noise. However, estimation 
of eigenfunctions becomes a nonparametric problem when observa- 
tions are sparse. The optimal convergence rates in this case are those 
which pertain to more familiar function-estimation settings. We also 
describe the effects of sampling at regularly spaced points, as opposed 
to random points. In particular, it is shown that there are often ad- 
vantages in sampling randomly. However, even in the case of noisy 
data there is a threshold sampling rate (depending on the number of 
functions treated) above which the rate of sampling (either randomly 
or regularly) has negligible impact on estimator performance, no mat- 
ter whether eigenfunctions or eigenvectors are being estimated. 
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1. Introduction. 

1.1. Connections between FDA and LDA. Advances in modern technol- 
ogy, including computing environments, have facilitated the collection and 
analysis of high-dimensional data, or data that are repeated measurements 
of the same subject. If the repeated measurements are taken over a period 
of time, say on an interval J? , there are generally two different approaches 
to treating them, depending on whether the measurements are available on 
a dense grid of time points, or whether they are recorded relatively sparsely. 

When the data are recorded densely over time, often by machine, they 
are typically termed functional or curve data, with one observed curve (or 
function) per subject. This is often the case even when the data are observed 
with experimental error, since the operation of smoothing data recorded at 
closely spaced time points can greatly reduce the effects of noise. In such 
cases we may regard the entire curve for the ith subject, represented by the 
graph of the function Xi{t) say, as being observed in the continuum, even 
though in reality the recording times are discrete. The statistical analysis of 
a sample of n such graphs is commonly termed functional data analysis, or 
FDA, and can be explored as suggested in the monographs by Ramsay and 
Silverman [27, 28]. 

Biomedical longitudinal studies are similar to FDA in important respects, 
except that it is rare to observe the entire curve. Measurements are often 
taken only at a few scattered time points, which vary among subjects. If we 
represent the observation times for subject i by random variables Tij, for 
j = 1, . . . ,mj, then the resulting data are (Xi(Tn), . . . , Aj(Tj mi )), generally 
observed with noise. The study of information in this form is often referred 
to as longitudinal data analysis, or LDA. See, for example, [12] or [20]. 

Despite the intrinsic similarities between sampling plans for functional 
and longitudinal data, statistical approaches to analyzing them are gener- 
ally distinct. Parametric technologies, such as generalized estimating equa- 
tions or generalized linear mixed effects models, have been the dominant 
methods for longitudinal data, while nonparametric approaches are typi- 
cally employed for functional data. These and related issues are discussed 
by Rice [31]. 

A significant, intrinsic difference between the two settings lies in the per- 
ception that functional data are observed in the continuum, without noise, 
whereas longitudinal data are observed at sparsely distributed time points 
and are often subject to experimental error. However, functional data are 
sometimes computed after smoothing noisy observations that are made at 
a relatively small number of time points, perhaps only a dozen points if, 
for example, full-year data curves are calculated from monthly figures (see, 
e.g., [26]). Such instances indicate that the differences between the two data 
types relate to the way in which a problem is perceived and are arguably 
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more conceptual than actual — for example, in the case of FDA, as one where 
discretely recorded data are more readily understood as observations of a 
continuous process. 

As this discussion suggests, in view of these close connections, there is 
a need to understand the interface between FDA and LDA views of data 
that might reasonably be thought of as having a functional origin. This is 
one of the goals of the present paper. In the context of principal compo- 
nent analysis, we explain the effect that observation at discrete time points, 
rather than observation in the continuum, has on statistical estimators. In 
particular, and as we shall show, estimators of the eigenvalues 9j of principal 
components can be root-n consistent even when as few as two observations 
are made of each of the n subjects, and even if experimental error is present. 
However, in such cases, estimation of eigenfunctions ipj is at slower rates, 
but nevertheless at rates that would be optimal for function estimators if 
data on those functions were observed in conventional form. On the other 
hand, when the n random functions are fully observed in the continuum, the 
convergence rates of both eigenvalue and eigenfunction estimators are n" 1 / 2 . 

These results can be summarized by stating that estimation of 8j or of ipj 
is a semiparametric problem when the random functions are fully observed 
in the continuum, but that estimation of ipj is a nonparametric problem, 
whereas estimation of 6j remains semiparametric, when data are observed 
sparsely with noise. Indeed, if the number of observations per subject is at 
least two but is bounded, and if the covariance function of subjects has r 
bounded derivatives, then the minimax-optimal, mean square convergence 
rate of eigenfunction estimators is n~ 2r ^ 2r+1 \ This rate is achieved by esti- 
mators based on empirical spectral decomposition. We shall treat in detail 
only the case r = 2, since that setting corresponds to estimation of covari- 
ance using popular local smoothing methods. However, straightforward ar- 
guments give the extension to general r. 

We also identify and discuss the important differences between sampling 
at regularly spaced, and at random time points. Additionally we address 
the case where the number of sampled points per subject increases with 
sample size. Here we show that there is a threshold value rate of increase 
which ensures that estimators of 6j and of tpj are first-order equivalent to 
their counterparts in the case where subjects are fully observed, without 
noise. By drawing connections between FDA, where nonparametric methods 
are well developed and popular, and LDA, where parametric techniques 
play a dominant role, we are able to point to ways in which nonparametric 
methodology may be introduced to LDA. There is a range of settings in LDA 
where parametric models are difficult to postulate. This is especially true 
when the longitudinal data do not have similar "shapes," or are so sparse 
that the individual data profiles cannot be discerned. Measurement errors 
can also mask the shapes of the underlying subject profiles. Thus, more 
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flexible models based on nonparametric approaches are called for, at least at 
the early stage of data analysis. This motivates the application of functional 
data approaches, and in particular, functional principal component analysis, 
to longitudinal data. 

It might be thought that our analysis of the infinite-dimensional problem 
of FDA should reveal the same phenomena that are apparent in "large p, 
small n" theory for finite-dimensional problems. For example, estimators of 
the maximum eigenvalue might be expected to be asymptotically biased. See, 
for example, Johnstone [18] . However, these features are not present in the- 
oretical studies of conventional FDA methodology (see, e.g., [4, 11]), where 
complete functions are observed, and it is arguably unsurprising that they 
are not present. One reason is that, although FDA is infinite-dimensional, 
an essential ingredient distinguishing it from the multivariate vector case 
is smoothness. The problem of estimating any number of eigenvalues and 
eigenfunctions does not become successively more difficult as sample size 
increases, since this problem in some sense may be reduced to that of es- 
timating fixed smooth mean and covariance functions from the available 
functional data. In contrast, in typical "large p, small n" asymptotics, the 
dimension of covariance matrices is assumed to increase with sample size 
which gives rise to specific properties. 

The results in the present paper represent the first attempt at develop- 
ing concise asymptotic theory and optimal rates of convergence describing 
functional PC A for sparse data. Upper bounds for rates of convergence of 
estimated eigenfunctions in the sparse-data case, but not attaining the con- 
cise convergence rates given in the present paper, were developed by Yao, 
Miiller and Wang [38] under more restrictive assumptions. Other available 
theoretical results for functional PC A are for the ideal situation when entire 
random functions are observed, including Dauxois, Pousse and Romain [11], 
Bosq [3], Pezzulli and Silverman [25], Boente and Fraiman [2], Cardot [7], 
Girard [14] and Hall and Hosseini-Nasab [15]. There is an extensive literature 
on the general statistical analysis of functional data when the full functions 
are assumed known. It includes work of Besse and Ramsay [1], Castro, Law- 
ton and Sylvestre [10], Rice and Silverman [32], Brumback and Rice [5] and 
Cardot, Ferraty and Sarda [8, 9], as well as many articles discussed and cited 
by Ramsay and Silverman [27, 28]. Kneip and Utikal [21] used methods of 
functional data analysis to assess the variability of densities for data sets 
from different populations. Contributions to various aspects of the analysis 
of sparse functional data, including longitudinal data observed with mea- 
surement error, include those of Shi, Weiss and Taylor [34], Staniswalis and 
Lee [35], James, Hastie and Sugar [17], Rice and Wu [33] and Miiller [24]. 
For practical issues of implementing and applying functional PCA, we refer 
to [6, 19, 29, 32, 37, 38]. 
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2. Functional PCA for discretely observed random functions. 

2.1. Functional principal component analysis. Let X±, . . . , X n denote in- 
dependent and identically distributed random functions on a compact in- 
terval J 1 ', satisfying E(X 2 ) < oo. The mean function is /i = E(X), and 
the covariance function is ip(u,v) = cov{X(u), X(v)}. Functional PCA is 
based on interpreting tp as the kernel of a linear mapping on the space 
L2{J ! ) of square-integrable functions on J 1 , taking a to ipa defined by 
(tpa)(u) = Jjr a(v )ip(u, v) dv. For economy we use the same notation for an 
operator and its kernel. Mercer's theorem (e.g., [16], Chapter 4) now implies 
a spectral decomposition of the function tp, 

oo 

(2.1) ^(u,v) = ^e j if> j (u)^ j (v), 

J=l 

where 6\ > 62 > ■ ■ ■ > are ordered values of the eigenvalues of the operator 
tp, and the ipj's are the corresponding eigenfunctions. 

The eigenfunctions form a complete orthonormal sequence on L2(J?), and 
so we may represent each function X{ — \i in terms of its generalized Fourier 
expansion in the ipj's, 

OO 

(2.2) X i (t) = fi(t) + J2&M t )> 

where Qj = J '^{Xi-^ipj is referred to as the jth functional principal com- 
ponent score, or random effect, of the ith subject, whose (observed as in 
FDA or hidden as in LDA) random trajectory is Xj. The expansion (2.2) is 
referred to as the Karhunen-Loeve or functional principal component expan- 
sion of the stochastic process Xi . The fact that tpj and ipk are orthogonal for 
j ^ k implies that the random variables Qj, for 1 < j < 00, are uncorrelated. 

Although the convergence in (2.2) is in L2, not pointwise in t, the only 
purpose of that result, from the viewpoint of this paper, is to define the 
principal components, or individual effects, Qj. The values of those random 
variables are defined by (2.2), with probability 1. 

The difficulty of representing distributions of random functions means 
that principal component analysis assumes even greater importance in the 
setting of FDA than it does in more conventional, finite-dimensional statis- 
tical problems. Especially if j is small, the shape of the function tpj con- 
veys interpretable information about the shapes one would be likely to find 
among the curves in the data set X\, . . . ,X n , if the curves were observable. 
In particular, if ip\ has a pronounced turning point in a part of where 
the other functions, with relatively low orders, are mostly flat, then the 
turning point is likely to appear with high probability in a random func- 
tion Xi. The "strength" with which this, or another, feature is likely to 
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arise is proportional to the standard deviation of Qj, that is, to the value 

1/2 

of 9j . Conversely, if all eigenfunctions are close to zero in a given region, 
we may conclude that the underlying random process is constrained to be 
close to its mean in this region with relatively little random variation. These 
considerations and others, including the fact that (2.1) can be used to rep- 
resent a variety of characteristics of the random function X, motivate the 
development of methods for estimating each 6j and each ipj. 

2.2. Estimation. Let X\, . . . ,X n be as in Section 1.1, and assume that 
for each i we observe pairs (T^, Y^-) for 1 <j < im, where 

(2.3) Yij = Xi(Tij) + Eij, 

the "observation times" Tij all lie in the compact interval , the errors Sij 
have zero mean and each rrij > 2. For simplicity, when developing theoret- 
ical properties it will be assumed that the Ty's are identically distributed 
random variables, that the errors e« are also identically distributed, with 
finite variance E(e 2 ) = a 2 , and that the AVs, Tij's and e^'s are totally in- 
dependent. However, similar results may be obtained with a degree of weak 
dependence, and in cases of nonidentical distributions. 

Using the data set V = {(T^, Yij), 1 <j < m;, 1 < i < n}, we wish to 
construct estimators 6j and ipj of 6j and ipj, respectively. We start with 
estimators p, of fi = E(X), and ip of the autocovariance, ip; definitions of 
p, and ip will be given shortly. The function ip, being symmetric, enjoys an 
empirical version of the expansion at (2.1), 

oo 

(2.4) ${u,v) = ^e$j{u)$j{v). 

i=i 

Here, 61,62,- ■ ■ are eigenvalues of the operator tp, given by (ipa)(u) = Jjs a(v) x 
ip(u, v) dv for a € Li^), and tpj is the eigenfunction corresponding to 6j. In 
Section 3 we shall develop properties of 6j and ipj. Given jo>l, the 6j's are 
ordered so that 6\> ■ ■ ■ > 6j > 6j , the last inequality holding for all j > jo . 

The signs of ipj and ipj can be switched without altering either (2.1) 
or (2.4). This does not cause any difficulty, except that, when discussing the 
closeness of tpj and ipj, we clearly want them to have the same parity. That 
is, we would like these eigenvectors to "point in the same general direction" 
when they are close. We ensure this by allowing the sign of ipj to be chosen 
arbitrarily, but asking that the sign of tpj be chosen to minimize \\ipj — ipj\\ 
over both possible choices, where here and in the following || • || denotes the 
La-norm, ||^|| = (/ V> 2 ) 1/2 - 
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We construct first /2(u), and then ip(u, v), by least-squares fitting of a local 
linear model, as follows. Given u £ , let and denote bandwidths and 
select (a, 6) = (a,b) to minimize 



n mi 



Tjj - U 



i=ii=i v ft f 

and take //(it) = a. Then, given u,v & J^, choose (60,61,62) = (00,^1,^2) to 
minimize 

n 

£ Yl {YijY i k-a -b 1 (u-T ij )-b 2 (v-T ik )} 2 

i=l j,k : l<j^k<mi 

Tij - u\ (T ik - v 



The quantity do estimates <p(u,v) = E{X(u)X(v)}, and so we denote it 
by (j)(u, v). Put 

tJj{u,v) =0(u,i;) -fl(u)fl(v). 

These estimates are the same as those proposed in [38], where practical fea- 
tures regarding the implementation are discussed in detail. The emphasis 
in [38] is on estimating the random effects Qj, for which Gaussian assump- 
tions are made on processes and errors. We extend the consistency results 
for eigenvalues and eigenvectors of [38] in four significant ways: First, we 
establish concise first-order properties. Second, the first-order results in the 
present paper imply bounds that are an order of magnitude smaller than the 
upper bounds provided by Yao, Miiller and Wang [38]. Third, we derive the 
asymptotic distribution of estimated eigenvalues. Fourth, we characterize 
a transition where the asymptotics of "longitudinal behavior" with sparse 
and noisy measurements per subject transform into those of "functional be- 
havior" where random trajectories are completely observed. This transition 
occurs as the number of measurements per subject is allowed to increase at 
a certain rate. 

The operator defined by tp is not, in general, positive semidefinite, and 
so the eigenvalues Oj at (2.4) may not all be negative. Nevertheless, ip is 
symmetric, and so (2.4) is assured. 

Define Uij = u - T^, V ik = v- T ik , Z ijk = YijY ik , 



Using this notation we may write 

(2.5) ftu) = S2 ?°~ Sl 3\ 0(u,v) = (A 1 R oo -A 2 R lo -A 3 R ol )B' 

JQD2 — <->l 



<s 
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where 

A\ = S20S02 — Sfx, A2 = SioSqq — SqiSh, A3 = S01S20 — SioSn, 



n mi n mi 

13, 



i=lj=l i=lj=l 



Srs = E E E Rrs = E E E uijV&ZijkWijk, 

ijik:j<k i,j,k:j<k 
B = A\Sqo — ^2'S'lo — A3S01 

= {eEE(^ - ^J{EEE(^ - vfwJ 

v i,j,k:j<k ) \ i,j,k:j<k ) 

v i,j,k:j<k ) 

Q=(EEE q«w*) I (e E E ^) . 

\ i,j,k:j<k / V i,j,k:j<k / 

for Q = U,V. Here we have suppressed the dependence of 5 r , i? r and Wy 
on u, and of A r , B, S rs , R rs and VFp on (tt, w). 

3. Theoretical properties. 

3.1. Main theorems. Our estimators /2 and tp have been constructed by 
local linear smoothing, and so it is natural to make second derivative as- 
sumptions below, as a prerequisite to stating both upper and lower bounds 
to convergence rates. If ju and tp were defined by rth-degree local polynomial 
smoothing, then we would instead assume r derivatives, and in particular 
the optimal L2 convergence rate of ijjj would be n~ r /( 2r+1 ) rather than the 
rate ra -2 / 5 discussed below. 

Assume that the random functions X{ are independent and identically 
distributed as X and are independent of the errors e^-; that the latter are 
independent and identically distributed as e, with E(e) = and E(e 2 ) = a 2 ; 
that 

for each C >0 max E{ sup \X ij) {u)\ C \ + E(\e\ c ) < 00; 
7=0,1,2 y u( zj! J 

that the kernel function K is compactly supported, symmetric and Holder 
continuous; that for an integer jo > 1 there are no ties among the jo + 1 
largest eigenvalues of (j) [although we allow the (jo + l)st largest eigenvalue 
to be tied with the (jo + 2)nd]; that the data pairs (Tij,Yij) are observed for 
1 < j < Trii and 1 < i < n, where each mj > 2 and maxj< n rrii is bounded as 
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n — > oo; that the Tjj's have a common distribution, the density, /, of which is 
bounded away from zero on J^; and that n n ~^ 1 / 2 ^ <h fl = o(l), for some r/ > 0. 
In addition, for parts (a) and (b) of Theorem 1 we assume, respectively, 
that (a) n n ~( 1 /^ 1 < for some rj > 0, max(n _1 / 3 /i^/ 3 , n~ 1 h^ 8 ^ 3 ) = o(/i M ), 

= o(/i^) and h<j, = o(l); and (b) n v ~( 3 / 8 "> < and h ( f, + h^ = o(n -1 / 4 ). 
Call these conditions (C). 

In conditions (C) above we suppose the m^'s to be deterministic, but 
with minor modifications they can be taken to be random variables. Should 
some of the m^'s be equal to 1, these values may be used to estimate the 
mean function, fi, even though they cannot contribute to estimates of the 
covariance. For simplicity we shall not address this case, however. 

Put x = X — /j,, and define k = J K 2 , k 2 = J u 2 K{u) du, 



c(r,s 



f{t)- l ^ r {t)^ s {t)dt, 



(3.1) P(u, v, w, z) = E{x(u)x(v)x(w)x(z)} — ip(u, v)ip(w, z), 

X(U,V) = ^K 2 {lp 20 (u,v) +1p 2(u,v) + fl" (u) (J,(v) + (J,(u) fl" (v)} , 

where ip rs (u,v) = (d r+s /du r dv s )ip(u, v ). Let 

N= \ ^2mi{mi - 1) 

i<n 

and 

Kr,-) = EEEEE^K/( r oO/(ri* 1 )/(r« a )/(r ifa )}- 1 

i=l ji<ki j2<ki 

x f3(Tij 1 , Tik± i ^ij2 1 Tik2 ) 

x Vv (Ti h ) tp r (T ikl ) tp s (T ij2 ) ip s (T ik2 )] . 

This formula has a conceptually simpler, although longer to write, version, 
obtained by noting that the Ty's are independent with density /. Asymp- 
totic bias and variance properties of estimators are determined by the quan- 
tities 

r rc\ ff Ejxjt^xih) 2 } + a 2 2 

(3.2) C 2 = C 2 (j) = E ~ 6 *)~ 2 ( I xMk) , 

(E)„ = N~ 2 {u{r, s) + Na 2 c(r, s) 2 }. 

Let E denote the jo x jo matrix with (r, s)th component equal to (E) rs . 

Note that (E) rs = 0(n~ l ) for each pair (r, s). Write a, 9 and # for the 
vectors (a±, . . . , a,j ) T , . . . , 0j Q ) T and (9\, . . . , 9j ) T , respectively. 
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Our next result describes large-sample properties of eigenvalue and eigen- 
function estimators. It is proved in Section 4. 

Theorem 1. Assume conditions (C). Then, (a) for 1 < j <jo, 

(3.3) - ^ || 2 = ^- + C 2 h% + o p {{nh^y l + h% 

and (b) for any vector a, a T (9 — 6) is asymptotically normally distributed 
with mean zero and variance a T Sa. 

The representation in part (b) of the theorem is borrowed from [13]. 
Bounds on ipj — ipj and on 0j — 9j, which hold uniformly in increasingly 
large numbers of indices j , and in particular for 1 < j < jo = jo (n) where 
jo{n) diverges with n, can also be derived. Results of this nature, where 
the whole functions Xi are observed without error, rather than noisy ob- 
servations being made at scattered times Tij as in (2.3), are given by Hall 
and Hosseini-Nasab [15]. The methods there can be extended to the present 
setting. However, in practice there is arguably not a great deal of interest in 
moderate- to large-indexed eigenfunctions. As Ramsay and Silverman [28] 
note, it can be difficult to interpret all but relatively low-order principal 
component functions. 

The order of magnitude of the right-hand side of (3.3) is minimized by 
taking h x n~ 1//5 ; the relation a n xb n , for positive numbers a n and b n , means 
that a n /b n is bounded away from zero and infinity as n — > oo. Moreover, it 
may be proved that if h X n -1 / 5 , then the relation \\ipj — ipj\\ = O p (n _2//5 ), 
implied by (3.3), holds uniformly over a class of distributions of processes 
X that satisfy a version of conditions (C). The main interest, of course, lies 
in establishing the reverse inequality, uniformly over all candidates ipj for 
estimators of ipj , thereby showing that the convergence rate achieved by ipj 
is minimax-optimal. 

We shall do this in the case where only the first r eigenvalues 9\ , . . . , 6 r 
are nonzero, with fixed values, where the joint distribution of the Karhunen- 
Loeve coefficients Cn^- ■ ■ Xir [see (2.2)] is also fixed, and where the ob- 
servation times Tij are uniformly distributed. These restrictions actually 
strengthen the lower bound result, since they ensure that the "max" part of 
the minimax bound is taken over a relatively small number of options. 

The class of eigenfunctions ipj will be taken to be reasonably rich, however; 
we shall focus next on that aspect. Given c\ > 0, let S(c\) denote the 
Sobolev space of functions <f> on for which max s= o,i,2Sup t6< ^ |<^ s )(t)| < c\. 
We pass from this space to a class * = ^(ci) of vectors tp = (V>i, ■ • ■ ,Vv) of 
orthonormal functions, as follows. Let ipn, ■ ■ ■ , ip\ r denote any fixed functions 
that are orthonormal on J 1 and have two bounded, continuous derivatives 
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there. For each sequence (fi\,...,(fi r of functions in <S(ci), let ifi\,...,ifi r be 
the functions constructed by applying Gram-Schmidt orthonormalization 
to ifi\\ + (/>i, . . . , ipi r + (j> r , working through this sequence in any order but 
nevertheless taking ifij to be the new function obtained on adjoining tfiij + <fij 
to the orthonormal sequence, for 1 < j < r. If c\ is sufficiently small, then, 
for some C2 > 0, 

sup max max sup M^f 1 < Co. 

Moreover, defining A,- to be the class of functions tp2j = ipij + <Aj f° r which 
(fij G «S(ci) and / = 1, we have 

(3.4) ^( Cl ) C{^:(^i,...,Vr) G^} 

for 1 < j < r. In the discussion below we shall assume that these properties 
hold. 

Let 9\ > ■ ■ ■ > 9 r > be fixed, and take = 6 r +i = 9 r +2 = ■••■ Let Ci> • • • > Cr 
be independent random variables with continuous distributions, all mo- 
ments finite, zero means, and E(Q) = 9j for 1 < j < r. Assume we ob- 
serve data Yij = JQ(Ty) + £j,-, where 1 < i < n, 1 < j < m, m>2 is fixed, 

x i = Ei<Kr dkifij, (ipl, ■ ■ ■ ,ip r ) € each Ci = (Cil) • ■ • . Cir) is distributed as 
(Ci, • ■ • , Cr), each Tjj is uniformly distributed on J? = [0, 1], the e^-'s are iden- 
tically normally distributed with zero mean and nonzero variance, and the 
Ci's, Tjj's and e^'s are totally independent. Let vl/j denote the class of all 
measurable functionals tfij of the data T> = {(Tij,Yij), 1 < i < n, 1 <j < m}. 
Theorem 2 below asserts the minimax optimality in this setting of the L2 
convergence rate n _2//5 for tfij given by Theorem 1. 

Theorem 2. For the above prescription of the data V, and assuming 
h^n~ 1 ^, 

(3.5) lim limsup max sup P(\\ifij — ifij\\ > Cn~ 2 ^) = 0; 
and for some C > 0, 

(3.6) liminf min inf sup P{MAV) - ifi- II > Cn~ 2/5 } > 0. 

n— »oo Kj<rl e i, r T 

It is possible to formulate a version of (3.5) where, although the maximum 
over j continues to be in the finite range 1 < j < r, the supremum over ip G \l/ 
is replaced by a supremum over a class of infinite-dimensional models. There 
one fixes 9± > ■ ■ ■ > 6 r > 9 r+ \ > 9 r+ 2 > ■ ■ ■ > 0, and chooses Vv+i, Vv+2 5 • • • by 
extension of the process used to select ifi\ , . . . , ip r . 
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3.2. Discussion. Part (b) of Theorem 1 gives the asymptotic joint dis- 
tribution of the components 6j, and from part (a) we may deduce that the 
asymptotically optimal choice of for estimating ipj is ~ (C1/4C2-/V) 1 / 5 x 
n" 1 / 5 . More generally if x n" 1 / 5 , then by Theorem 1 — tjjj\\ = O p (n~ 2 / 5 ) 
By Theorem 2 this convergence rate is asymptotically optimal under the as- 
sumption that ip has two derivatives. For Ha, of size n -1 / 5 , the conditions 
on imposed for part (a) of Theorem 1 reduce to n~ 7 / 15 = o(/i^) and 

V = (?l- 1 /5). 

In particular, Theorem 1 argues that a degree of undersmoothing is nec- 
essary for estimating tpj and 9j. Even when estimating ipj, the choice of 
can be viewed as undersmoothed, since the value const, n -1 / 5 , suggested by 
(3.3), is an order of magnitude smaller than the value that would be optimal 
for estimating (f>; there the appropriate size of is n" 1 / 6 . The suggested 
choice of is also an undersmoothing choice, for estimating both tpj and 6j. 

Undersmoothing, in connection with nonparametric nuisance components, 
is known to be necessary in situations where a parametric component of a 
semiparametric model is to be estimated relatively accurately. Examples 
include the partial-spline model studied by Rice [30] , and extensions to lon- 
gitudinal data discussed by Lin and Carroll [22]. In our functional PCA 
problem, where the eigenvalues and eigenfunctions are the primary targets, 
the mean and covariance functions are nuisance components. The fact that 
they should be undersmoothed reflects the cases mentioned just above, al- 
though the fact that one of the targets is semiparametric, and the other 
nonparametric, is a point of departure. 

The assumptions made about the mj's and T^'s in Theorems 1 and 2 
are realistic for sparsely sampled subjects, such as those encountered in 
longitudinal data analysis. There, the time points Tij typically represent 
the dates of biomedical follow-up studies, where only a few follow-up visits 
are scheduled for each patient, and at time points that are convenient for 
that person. The result is a small total number of measurements per subject, 
made at irregularly spaced points. 

On the other hand, for machine-recorded functional data the m^s are 
usually larger and the observation times are often regularly spaced. Neither 
of the two theorems is valid if the observation times are of this type, 
rather than (as in the theorems) located at random points. For example, 
if each rrii = m and we observe each Xi only at the points = j/m, for 
1 < j < m, then we cannot consistently estimate either 8j or ipj from the 
resulting data, even if no noise is present. There exist infinitely many distinct 
distributions of random functions X, in particular of Gaussian processes, 
for which the joint distribution of X(j/m), for 1 < j < m, is common. The 
stochastic nature of the Tjj's, which allows them to take values arbitrarily 
close to any given point in J^, is critical to Theorems 1 and 2. Nevertheless, 
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if the rnj's increase sufficiently quickly with n, then regular spacing is not a 
problem. We shall discuss this point in detail shortly. 

Next we consider how the results reported in Theorem 1 alter when each 
rrii>m and m increases. However, we continue to take the Tj,-'s to be random 
variables, considered to be uniform on for simplicity. In this setting, 
N > \m(m — l)n and u(r, s) = ^m^ndfr, s) + 0(m 3 n), where 



If we assume each m.j = m, then it follows that (S) rs = n~ 1 d(r, s) + 0{(mn)~ 1 }, 
as m,n—> oo. The leading term here, that is, n~ 1 d(r,s), equals the (r, s)th 
component of the limiting covariance matrix of the conventional estimator 
of ,0j o ) T when the full curves Xi are observed without noise. [It may 

be shown that the proof leading to part (b) of Theorem 1 remains valid in 
this setting, where each rrtj = m and m = m(n) — > oo as n increases.] This 
reflects the fact that the noisy sparse-data, or LDA, estimators of eigenval- 
ues converge to their no-noise and full-function, or FDA, counterparts as 
the number of observations per subject increases, no matter how slow the 
rate of increase. 

The effect on tjjj of increasing m is not quite as clear from part (a) of 
Theorem 1. That result implies only that — ipj || 2 = C^br^ + o p {(nh ( j,)~ 1 }. 
Therefore the order of magnitude of the variance component is reduced, and 
a faster L2 convergence rate of ijjj to tpj can be achieved by choosing 
somewhat smaller than before. However, additional detail is absent. 

To obtain further information it is instructive to consider specifically 
the "FDA approach" when full curves are not observed. There, a smooth 
function estimator Xi of Xi would be constructed by passing a statistical 
smoother through the sparse data set T>i = {(Tij,Yij), 1 < j < mi}, with Yij 
given by (2.3). Functional data analysis would then proceed as though Xi 
were the true curve, observed in its entirety. The step of constructing Xi is 
of course a function estimation one, and should take account of the likely 
smoothness of Xi. For example, if each Xi had r derivatives, then a local 
polynomial smoother of degree r— 1 might be passed through Pj. (Kneip and 
Utikal [21] also employed a conventional smoother, this time derived from 
kernel density estimation, in their exploration of the use of functional-data 
methods for assessing different population densities.) 

Let us take r = 2 for definiteness, in keeping with the assumptions leading 
to Theorems 1 and 2, and construct Xi by running a local-linear smoother 
through T>i. Assume that each m, > m. Then the following informally stated 
property may be proved: The smoothed function estimators Xi are as good 
as the true functions Xi, in the sense that the resulting estimators of both 
6j and ipj are first-order equivalent to the root-n consistent estimators that 




f3(u, v, w, z)ip r (u)ijj r (v)ip s (w)'i/j s (z) du dv dw dz. 



14 



P. HALL, H.-G. MULLER AND J.-L. WANG 



arise on applying conventional principal component analysis to the true 
curves Xi, provided m is of larger order than n 1 / 4 . Moreover, this result 
holds true for both randomly distributed observation times Tij and regu- 
larly spaced times. A formal statement and outline proof of this result are 
given in Section 3.4. 

These results clarify issues that are sometimes raised in FDA and LDA, 
about whether effects of "the curse of dimensionality" have an impact through 
the number of observations per subject. It can be seen from our results that 
having rrii large is a blessing rather than a curse; even in the presence of 
noise, statistical smoothing successfully exploits the high-dimensional char- 
acter of the data and fills in the gaps between adjacent observation times. 

Theorem 1 provides advice on how the bandwidth might be varied 
for different eigenfunctions ipj. It suggests that, while the order of magni- 
tude of hfi need not depend on j, the constant multiplier could, in many 
instances, be increased with j. The latter suggestion is indicated by the fact 
that, while the constant C\ in (3.3) will generally not increase quickly with 
j, C2 will often tend to increase relatively quickly, owing to the spacings 
between neighboring eigenvalues decreasing with increasing j. The connec- 
tion to spacings is mathematically clear from (3.2), where it is seen that 
by decreasing the values of 9j — 8k we increase Ci. Operationally, it is ob- 
served that higher-order empirical eigenfunctions are typically increasingly 
oscillatory, and hence require more smoothing for effective estimation. 

3.3. Proof of Theorem 2. The upper bound (3.5) may be derived using 
the argument in Section 4. To obtain (3.6) it is sufficient to show that if 
j G [l,r] is fixed and the orthonormal sequence {ipi, • ■ • ,ip r } is constructed 
starting from ipj € Aj[c\); and if, in addition to the data V, the values of 
each Qk, for 1 < i < n and 1 < k < m, and of each ipk(Tu), for 1 < i < n, 
k^ j and 1 < I < m, are known; then for some Cj > 0, 

liminf inf sup P{\UJV) - ipA\ > C,-n- 2/5 | > 0. 

[To obtain this equivalence we have used (3.4).] That is, if we are given only 
the data V = {(T^, Qj,ipj(Tij) + EijC,^ 1 ), 1 < i < n, 1 < j < m}, and if #j 
denotes the class of measurable functions ipj of T>\ then it suffices to show 
that for some Cj > 0, 

liminf inf sup P{\\i>j{V) - ipj\\ > C,-n _2/5 } > 0. 

Except for the fact that the errors here are ^ijCij 1 rather than simply e^-, 
this result is standard; see, for example, [36]. The factor Q^- 1 is readily dealt 
with by using a subsidiary argument. 
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3.4. Random function approximation. In Section 3.2 we discussed an ap- 
proach to functional PCA that was based on running a local-linear smoother 
through increasingly dense, but noisy, data on the true function Xi, produc- 
ing an empirical approximation Xj. Here we give a formal statement and 
outline proof of the result discussed there. 

Theorem 3. Suppose each rrii > m, and assume conditions (C) from 
Section 3.1, except that the observation times T{j might be regularly spaced 
on rather than being randomly distributed there. Estimate 9j and ipj us- 
ing conventional PCA for functional data, as though each smoothed function 
estimator Xi really were the function Xi. Then the resulting estimators of 
6j and ipj are root-n consistent, and first-order equivalent to the conven- 
tional estimators that we would construct if the X{" 's were directly observed, 
provided m = m{n) diverges with n and the bandwidth, h, used for the local- 
linear smoother satisfies h = o(n -1 / 4 ), mhn~ Sl — » oo and m 1_(52 /t — » oo, for 
some 61,62 > 0. 

We close with a proof. Observe that the estimator of ip{u,v) that results 
from operating as though each Xi is the true function Xi, is 

${u,v) = -J2{Xi(u) - X(u)}{X t (v) - X(v)}, 

n r~ 

i=i 

where X = n _1 ^jXj. The linear operator that is defined in terms of ip 
is positive semideflnite. The FDA-type estimators 9j and ipj of 9j and ijjj, 
respectively, would be constructed by simply identifying terms in the corre- 
sponding spectral expansion, 

00 

?{,(u,v) = J2&jipj(u)ipj(v), 

3=1 

where 9± > 92 > ■ - ■ >0. Of course, if we were able to observe the process Xi 
directly, without noise, we would estimate ip using 

1 71 

i>{u, v) = - - A(u)}{A^) - X{v)}, 

where X = n~ x J2i Xi> an d take as our estimators of 9j and ipj the corre- 
sponding terms 9j and ipj in the expansion, 

00 

ip(u,v) = ^9jipj(u)ipj(v), 
3=1 

with d x > 9 2 > • • • > 0. 
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Methods used to derive limit theory for the estimators 6j and tpj (see, 
e.g., [15]) may be used to show that the estimator pairs (6j,tpj) and (6j,tpj) 
are asymptotically equivalent to first order if tp — tp = o p (n _1//2 ), but gen- 
erally not first-order equivalent if tp and tp differ in terms of size n -1 / 2 or 
larger. Here, distances are measured in terms of the conventional L2 metric 
for functions. Since we have used a local-linear smoother to construct the 
functions Xi from the data T>, then the bias contribution to tp — tp is of 
size h 2 , where h denotes the bandwidth for the local-linear method. The 
contribution from the error about the mean is of size (mn/i) -1 / 2 at each 
fixed point. The "penalty" to be paid for extending uniformly to all points 
is smaller than any polynomial in n. Indeed, using an approximation on a 
lattice that is of polynomial fineness, the order of magnitude of the uniform 
error about the mean can be seen to be of order n s (mnh)" 1 ^ 2 for each 5 > 0. 
Theorem 3 follows. 

4. Proof of Theorem 1. 

Step (i): Approximation lemmas. Let tp denote a symmetric, strictly 
positive-definite linear operator on the class £2^) of square-integrable 
functions from the compact interval to the real line, with a kernel, also 
denoted by tp, having spectral decomposition given by (2.1). Denote by tp 
another symmetric, linear operator on Z^^)- Write the spectral decompo- 
sition of tp as tp(u,v) =J2j>i@j' t Pj( u )'4>j( v )- Since tp is nonsingular, then its 
eigenfunctions tpj, appearing at (2.1), comprise a complete orthonormal se- 
quence, and so we may write tpj = J2k>i djk^Pk, f° r constants cijk satisfying 
J2k>i^jk = 1- We- may choose ajj to be either positive or negative, since 
altering the sign of an eigenfunction does not change the spectral decompo- 
sition. Below, we adopt the convention that each djj > 0. 

Given a function a on J^ 2 , define ||a|| = (/ fj? a 2 ) 1 / 2 , ||a||oo = sup \a\ and 

||c>!||(j) = J ij ol{u, v)tpj (v) tin. 
If a\ and «2 are functions on J? write / aaia^ to denote 

a(u, v)a\ (u)a.2 {v) du dv. 



'2 



For example, J(tp — ip)tpjipj in (4.2), below, is to be interpreted in this way. 
Let / aa\ denote the function of which the value at u is Jjr a(u, v)a±(v ) dv, 
and write |^| for the length of J> . 



Lemma 1. For each j > 1, 

„/, . I|2 _ Of1 _ T, 

'33 h 



(4.1) W^i-^i II 2 = 2(1 -a 
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(4.2) 



'3 v 3 

< \J 



(^-^Vi + (!-<•) 1 



(1 - a 



2 

33 



+ 21 



i>3 



Iplpjlp 



lor 



Lemma 1 implies that knowing bounds for 1 — a,-,- and for several norms 
of ip — ip gives us information about the sizes of — ipj\\ and 6j — 9j. 
We shall take ip = ip, in which case we have an explicit formula for — ip\\. 
Therefore our immediate need is for an approximation to djj , denoted below 
by djj when ip = ip. This requirement will be filled by the next lemma. 
Define A = ip — ip, and let djk denote the generalized Fourier coefficients for 
expressing ipj in terms of the ipks: *Pj = J2k>i ^jk^Pk, where we take djj > 0. 



Lemma 2. Under the conditions of Theorem 1, 



(4.3) 
(4.4) 



1-a 



&2 33 



i+ E(% 

k:kjtj 



Alpj1p k 



O p (\\A\\1 



Op(||A||||A| 



Lemma 1 is derived by using basic manipulations in operator theory. The 
proof of Lemma 2 involves more tedious arguments, which can be considered 
to be sparse-data versions of methods employed by Bosq [4] to derive his 
Theorem 4.7 and Corollaries 4.7 and 4.8; see also [23]. 



Step (ii): Implications of approximation lemmas. Since 2(1 — a 
33 Op(\l ~ a ^j\ 2 ) an d l|A||( 7 ) < ||A||, then Lemma 2 implies that 



■33 > 



1 



33' 



(4.5) 
where 



2(l-a ii ) = I>ii + Op(||A||||A||J 7 . ) ) > 
D 3i = E (°3 ~ °k)~ 2 (/ N>j*l>k) < const. || A| 



2 

or 



k : k^j 

Note too that 

(4.6) D j1 = D j2 + 

where 



IAII 2 



Aip jip 



3^3 ' 



(4.7) 



D 



32 



E m 

k:k^j 
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Standard arguments on uniform convergence of nonparametric function es- 
timators can be used to show that, under the conditions of Theorem 1, 
~ A* I |oo = Op(l) and ||0 — 4>\\oo = o p (l), from which it follows that \\ip — 
V'lloo = o p (l). Combining (4.5) and (4.6) with (4.1)-(4.4) we deduce that 

lift - V-j 

(4.8) 

(4.9) §j - 

Let E' denote expectation conditional on the observation times Ty, for 
1 < j < rrii and 1 < i < n. Standard methods may be used to prove that, 
under the bandwidth conditions imposed in either part of Theorem 1, and 
for each r) > 0, 

E'\\ A|| 2 = OpHnhlr 1 + (nh^y 1 + h 4 }, 

and -E"||A|| 2 ,^ = O v {(nh)~ l + /i 4 }, where h = /i^ + h^. Therefore, under the 
bandwidth assumptions made, respectively, for parts (a) and (b) of The- 
orem 1, the "Op" remainder term on the right-hand side of (4.8) equals 
Op{(nh,p)~ l + h^} + O p {(n/i M )~ 3 / 2 + h^}, while the remainder on the right- 
hand side of (4.9) equals o p {n~ 1 / 2 ). Hence, 

lift - V-j 

(4.10) 

(4.11) §j - 

Step (hi): Approximations to A. We may Taylor-expand Xi(Tij) about 
Xi(u), obtaining 

(4.12) XiiTij) = Xi{u) - UijX&u) + lU? j X"{u ij ), 

where Uij = u — TL- and the random variable Uj,- lies between u and 
and is of course independent of the errors e rs . For given u,v G J^, define 
43 = {Xi{u) + e t ,}{X t (v) + e tk }, V ik = v - T tk , zf k = ^X'^X^v) + 
VtkXiMX&v) and zf k = C/ 4J X i '(n)e jfc + V ik Xi(v)e jk . Let ^ denote the 

version of <f) obtained on replacing Zy k by Z^j,. In the following calculations, 
we may set /x = 0, without loss of generality. Using (4.12), and its analogue 



= A-2 + e- 2 ||A|| ) -0- 2 (/A^ j ) 

+ O p (||A|[||A||^), 
= /A^ + OpdlAQ. 



= ^ 2 + ^ 2 ||A|| ) -0- 2 (/A^ j ) 

+ opl^)" 1 + fcj} + O p {(n^)- 3 / 2 + h% 
= J A^j + opin^ 2 ). 
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for expansion about Xi(v) rather than Xi(u), we may write 
= {X^u) - UijXKu) + ±*7 2 Xf (^) + % } 
(4.13) x {Xi(v) - VikX^Vij) + \V^X'l(v) + e ik } 

l]K l]K l]K l]K ' 

where is defined by (4.13). Using standard arguments for deriving uni- 
form convergence rates it may be proved that for some rj > 0, and under the 
bandwidth conditions for either part of Theorem 1, 

sup \^\u,v) - E'{^(u,v)}\ = O p {n-^ /2) -^). 

[Note that the data z]Q, from which is computed, contain only quadratic 

terms in (Uij,Vik)- When the kernel weights are applied for constructing 0" ; 
only triples (i,j,k) for which |V^| < const, make a nonvanishing 

contribution to the estimator. This fact ensures the relatively fast rate of 
convergence.] Therefore, uniformly on J^ 2 , and under either set of bandwidth 
conditions, 

- E% = (AW - E'fiV _ (0[2] _ E'ft^ _ (£[3] _ E /^[3] ) 

(4.14) 

+ o P (n" 1/2 ). 

Put Y-j (u) = Xi(u) + Eij, and let juM denote the version of Ji obtained on 

replacing Yij by Y[j\ Define fift by ju = juM + /2^. Conventional arguments 
for deriving uniform convergence rates may be employed to show that for 
some rj > 0, and under either set of bandwidth conditions, 

(4.15) sup |/2 [1] (u) -E'{^{ U )}\ =O p (n- (1 ^- r i), 

(4.16) sup \^ 2 \u) -E'{^ 2 \u)}\ =O p (n- (1 ^- r i), 

where (4.15) makes use of the property that hn > rf 1 ^^ 1 ^ 2 ' 1 for some rj > 0. 
Combining (4.14)-(4.16) we deduce that, for either set of bandwidth condi- 
tions, 

(4.17) A = A + A 1 -A 2 -A3-A4-A 5 + A 6 , 
where A fc = <^ - E'fi® for k = 1, 2, 3, 

A (u, v) = E'$(u, v)} - {E'ji{u)}{E'ji{v)} - V>(u, v), 
A A (u,v)=E'{n(u)}{j2M(v)-E'n [1] (v)}, 
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Az,(u,v) = A4(v,u) and sup^ u ^ e y2 \Aq(u,v)\ = o p (n -1 / 2 ). Direct calcula- 
tion may be used to show that for each ro,SQ > 1, and for either set of 
bandwidth conditions, 



max max E' 

l<K31<r,s<r 

max max E' 

fc=2,31<r,s<r 



1 2 

&.k{u,v)il) r {u)il) s (v) dudv > = O p (n 1 ), 

1 2 

Afc(n, v )^ r (u)ifj s (v) dudv > = o p (n x ), 



max supE'l / / Ak{u,v)il) r (u)il) s (v)dudv\ = 0„(n 1 ), 

E'(\\A 1 \\ 2 U) ) = O p {(nh^~ 1 }, 
E'(\\A k \\l ) ) = O p (n~ 1 ), 



max 

fc=2,3 



max max E' 

k=l,2r>l,l<s<s 



max sup E' 

k=l,2 l<r<ro,s>l 



J?1 



E'{n(u)}{i2^(v)-E'^(v)} 
x ip r (u)ip s (v) dudv 
E'{Jl( U )}{n [k] (v)-E'tl [k] (v)} 



x ip r (u)ip s (v ) dudv 



O p {{nh,Y 1 }. 



Standard arguments show that 



sup \A (u,v) - ip(u,v)\ = O p (h 2 ). 
Combining results from (4.17) down, and defining 
(4.18) D j3 = £ {(% " 6 k)~ 2 ~ Of} ( f Ao^Vi 

k:k=ij \ J 

[cf. (4.7)], we deduce from (4.10) and (4.11) that, for the bandwidth condi- 
tions in (a) and (b), respectively, 



^ j f = D j3 + 0J 2 \\A o + A 1 f 



(4.19) 
(4.20) 



Aoipjipj 



+ OpKnh^y 1 + h%} + O p {{nh^/ 2 }, 
9j -6j = J (A + Ai - A 4 - A 5 )^j + o p (ra~ 1/2 ) 
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Step (iv): Elucidation of bias contributions. Let the function x be as de- 
fined at (3.1). It may be proved that E'{<p(u,v)} = 4>(u,v) + h'j > x{ u ^ v ) + 
Op(/t^), uniformly in with \u — v\ > 5 for any 6 > 0, and that 

E' {(fi(u,v)} = O p (/i^), uniformly in u, v € J^. Here one uses the fact that 
E{X(s)X(t)} = ip(s,t) + x(s)x(t); subsequent calculations involve replacing 
(s,i) by {Tij,Tik) on the right-hand side. 

Furthermore, E'{p{u)} = fj.(u) + O p (/i^), uniformly in u £ In view of 
these properties and results given in the previous paragraph, and noting 
that we assume /i M = o(/i<jj) in the context of (4.19), and = o(n~ 1//4 ) for 
(4.20), we may replace Ao by h^x m the definition of Dj^ at (4.18), and in 
(4.19) and (4.20), without affecting the correctness of (4.19) and (4.20). Let 
Dj4 denote the version of Dj% where Ao is replaced by h^x- 

Moment methods may be used to prove that 



(4.21) 
Furthermore 

(4.22) 



h% X + Ai||fo = E'\\h\ X + Ai + OpHnh^y 1 + h%} 

= hl\\x\\ 2 {j) + tf'IIAiH^ + OpKnh^y 1 + h%}. 



2 



i 2 

X^j^k 



k:k^j 

compare (4.6). Combining (4.21) and (4.22) with the results noted in the pre- 
vious paragraph, we deduce that, under the bandwidth conditions assumed 
for parts (a) and (b), respectively, of Theorem 1, 

Ife -^-f = oj 2 E'\\ Ail^ + fcJ £ (0j-e k r 2 ([xMk) 2 

k:k^j VJ 1 

(4.23) 

+ OpKnh^)- 1 + h%} + O p {(nhpy 3 / 2 }, 



(4.24) §j -8j= / (Ai - 2A 4 )^>j + o p {n~ l l 2 ). 



Step (v): Calculation of E'\\Ai\\ 2 jy Since S'||Ai||?-n = (,i(u) du, where 
£i{u)= / / S,2(u,vi,v 2 )^j{vi)^ j (v 2 )dv 1 dv2 



and £2(^1 ^2) = E'{Ai(u, vi)Ai(u, V2)}, then we shall compute £2^ v\, v 2 ). 

Recall the definition of <j) at (2.5). An expression for Ai is the same, ex- 
cept that we replace R rs in the formula for (f> by Q rs , say, which is defined 
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by replacing Zyk, in the definition of R rs in Section 2, by Z^j, — E'(Z^j,). It 
can then be seen that if, in the formulae in the previous paragraph, we re- 
place £2(^1,^2) by ^{u,vi,v 2 ) = E' {A* (u,Vi)A* (u,v 2 )} , where A*(u,v) = 
A±Qoo/B and A± and B are as in Section 2, then we commit an error of 
smaller order than (ra/i^) -1 + in the expression for £7'||Ai||?.n. 

Define x = X — /i and (3{s\, t\, s 2 , t 2 ) = E{x(si)x(ti)x(s 2 )x(t 2 )}. In this 
notation, 

B(u, vi)B(u, ui)£ 3 (tt, v 1 ,v 2 )/Ai(u, v 1 )A 1 (u, v 2 ) 

n 

= ^(-^ii ' -^ifa ' 1 Tik 2 ) 

Contributions to the fivefold series above, other than those for which (ji,k\) = 
(j'2, j'2), make an asymptotically negligible contribution. Omitting such terms, 
the right-hand side of (4.25) becomes 

Multiplying by i)j(vi)i)j(v 2 )Ai(u,vi)Ai{u,v 2 ){B{u,vi)B(u,v 2 )}~ 1 , integrat- 
ing over u,vi,v 2 , and recalling that N = J2i m i{ m i ~ 1)j we deduce that 

E'WAtf^-iNhl)- 1 f f{u)- 2 du 

x III 1 .J^* 1 '* 2 '* 1 '* 2 )-^ 2 } 

x {/(vi)/(^)}- 1 /(ii)/(i 2 ) dh dt 2 dv! dv2 

~ {Nh^y 1 J---J {f{t 1 )f{t 2 )Y 1 {p{t l ,t 2 ,t l ,t 2 ) + <y 2 } 

x ^(s) 2 iT(si)E:(s2)V'i(i2) 2 d*i dt 2 dsi ds 2 

= (Nh^d. 
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Result (3.3) follows from this property and (4.23). 

Step (vi): Limit distribution of Zj = /(Ai — 2Ai)ipjipj . It is straight- 
forward to prove that the vector Z, of which the jth component is Zj, is 
asymptotically normally distributed with zero mean. We conclude our proof 
of part (b) of Theorem 1 by finding its asymptotic covariance matrix, which 
is the same as the limit of the covariance conditional on the set of observa- 
tion times Tij. In this calculation we may, without loss of generality, take 
/x = 0. 

Observe that 

(4.26) cov' (Z r ,Z s ) = c u (r,s) - 2c u (r,s) - 2c u (s,r) +4c 44 (r,s), 
where the dash in cov' denotes conditioning on observation times, and 

c a b(r,s) = J J J J E'{A a (ui,Vl)A b (u2,V2)} 



x Vv^OVv^i)^^)"^^) dui dv\ du2 dv : 



2- 



We shall compute asymptotic formulae for en, C14 and C44. 

Pursuing the argument in step (iv) we may show that E'(AiA\) is asymp- 
totic to 

Ai(u 1 ,v 1 )A 1 (u 2 ,v 2 ) 



B(ui,vi)B(u 2 ,vi) 



E E pc^^ , , t^ 2 , ) 



hrh J V h 



x K { T in-U2 \ K f T lk2 -V2 



h<p J V h 

i=l j<k 



hs J V h, 



h<j) J \ h<j, 

The ratio A\A\jBB to the left is asymptotic to {5'oo(wi, wi)»S'oo( n 2> ^2)} 1 , 
and so to {iV^/M/M/^O/M}- 1 . Therefore, 



cn(r, s) ~ iV~ 



EEEEEU^)/^*)/^)/^)} -1 

i=l jl<fcl j2<^2 
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(4.27) x J J J J i K{w 1 )K(w 2 )K{w 3 )K{w A )dw 1 --- dw A 

n 

+^EEE{/^)/(^)}" 2 ^( T y)^( T *)^( T «i)^( T *) 

i=l j<k 

K {w\)K {w 2 )K {w%)K {w/C) dwi ■ ■ ■ dwi, 



p 



N~ 2 {u{r,s) + Na 2 c{r,s) 2 }. 



Observe next that, defining j(u, v, w) = E{X(u)X(v)X(w)} — 4>(u, v)/i(w), 
it may be shown that Sq(v2)Soq(ui,vi)E' {Ai(ui,vi)A4(u 2 , ^2)} / n{u 2 ) is 
asymptotic to 

n rrii 

£££ £ w mki{ui,vi)W ij2 (v 2 ) 

i=l j 1 <k 1 j 2 =l 

x E'([{Xi(ui) + e ih }{Xi{vi) +e ikl } - 0(«i,ui)] 
x {Xi(v 2 ) +e ij2 - fi{v 2 )}) 

n mi 
i=l ji<fci J2=l 

x {7(1*1, ui,v 2 ) + o- 2 5j U - 2 //(vi) + o- 2 (5 J2fcl /i(ni)}. 

The terms in a 2 can be shown to make asymptotically negligible contribu- 
tions to C14; the first of them vanishes unless u\ is within 0(hj,) of v 2 , and the 
second unless v\ is within 0(h^) of v 2 . Furthermore, defining N\ = Ni(n) = 
Ei<n m i> we have s o(v 2 ) ~ p Nth^fiv^ and Sbo(«i,«i) ~p Nh^ ) f{u 1 )f{v 1 ). 
From these properties, and the fact that we are assuming (without loss of 
generality) that fi = 0, it may be proved that 

(4.28) |cu(r,s)| + |c 44 (r, s)| = o p (?i" 1 ). 



Results (4.24) and (4.26)-(4.28) imply that the covariance matrix in the 
central limit theorem for t 
in part (b) of Theorem 1. 



central limit theorem for the vector of values of 9j — 8j has the form stated 
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